RAG Hallucination Examples

How are our current metrics defined?

Retrieval relevance
- The judge model evaluates whether the retrieved context is relevant to the user prompt. If the context has any relevant information or keywords from the user prompt, its marked as “GOOD”. Else, we mark it as “BAD”
Faithfulness
- The judge model evaluates whether the final generated response is based on the context that was retrieved. If portions of the response is hallucinated or if the response isn’t completely based on the context, we mark it as “BAD”. Else, we mark it as “GOOD”
Response relevance
- The judge model evaluates whether the final generated response is relevant to the original user prompt. As long as the response is a reasonable response to the user prompt, with relevant keywords and information, we mark it as “GOOD”. Else we mark it as “BAD”.

Next iteration of labels we are working on

Currently all of our RAG metrics just have a “GOOD” and “BAD” labels inside them. We are currently working on improving this taxonomy to be more fine-grained to accommodate more real world scenarios. The below sections explain our new approach:

Retrieval Relevance would be broken down into the following labels:

Poor Retrieval: The retrieved context does not contain any relevant facts to the question
Insufficient Retrieval: The retrieved context contains some relevant facts but nothing can be used to answer the question
Partial Retrieval: The retrieved context contains some relevant facts but only some facts can answer the questions
Good Retrieval: The retrieved context contains all the facts necessary to answer the questions

Faithfulness labels would be broken down into the following labels:

Refusal : When a response is just a refusal
Unfaithful Response: No part of the response is supported by the context
Partially Faithful Response: Only some parts of the response are supported by the context
Fully Faithful: All parts of the response are supported by the context

Response relevance labels would be broken down into the following labels:

Refusal: When a response is just a refusal
Irrelevant Response: The response contains no relevant part to the question
Partially Relevant Response: The response contains some parts that are related to the question, but also contains other non-relevant parts
Incomplete Response: All parts of the response are fully relevant but incomplete in answering all part of the question
Complete and Relevant: The response satisfactorily answers the questions

Retrieval Relevance Examples

Retrieval relevance

Faithfulness Examples

Faithfulness

Response Relevance Examples

Response relevance

How are our current metrics defined?​

Next iteration of labels we are working on​

Retrieval Relevance Examples​

Faithfulness Examples​

Response Relevance Examples​

How are our current metrics defined?

Next iteration of labels we are working on

Retrieval Relevance Examples

Faithfulness Examples

Response Relevance Examples